Use iteration count instead of time for parameters `warmup` and `rep` of `do_bench*` functions for benchmarks #2256

anmyachev · 2024-09-16T11:22:23Z

Partially reusing the changes, which were removed in #2142 (namely the part related to using iteration count instead of time) solves the problem of not having enough data for "CV" column.

Old comments

CI status:

https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10882781069 (100ms)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10883238613 (150ms)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10884030513 (150ms, without IPEX)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10901498420 (200ms)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10910606716 (300ms)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10920458857 (300ms, without IPEX)
- https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10942449220 (again with up to date main branch)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10994968680 (warmup=10)
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11092788850 (again with latest changes)

UPD: For some reason this greatly affects the mean time. However, if I reduce warmup, the mean does not deteriorate as much.

Signed-off-by: Anatoly Myachev <[email protected]>

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

Signed-off-by: Anatoly Myachev <[email protected]>

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

…ark.py

anmyachev · 2024-09-30T13:34:14Z

@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:

Triton ADV: -5%
Triton DFT: -2%
Xetla: -8%

Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of running multiple times (>3, only in this case "*-CV" column will not be NaN) is good (from the point of view of calculating the average).

cc @whitneywhtsang @etiotto

etiotto · 2024-10-01T14:48:30Z

@ESI-SYD @chengjunlu geomean diff will most likely be less, I will write the exact figures here after https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/11106696646/job/30855590713 is finished:

Triton ADV: -5%

Triton DFT: -2%

Xetla: -8%

Are you aware of this effect where as the number of runs increases, the average time gets noticeably worse? I don't know what to do with this slowdown, but I still think the idea of running multiple times (>3, only in this case "*-CV" column will not be NaN) is good (from the point of view of calculating the average).

cc @whitneywhtsang @etiotto

I think that when the warmup runs "too many times" the GPU may start heating up and then throttle the frequency down, so when the timed run start the performance is reduced. That means we are better off not increasing the rep/warmup to the point we see performance degradations in the benchmarks.

etiotto

I do not think we should increase the number of repetition too much. Going from 10 too 600 repetitions is a huge increase.

The kernel timing distribution should be a normal (gaussian) curve. We only need to run the benchmark enough times to approximate a gaussian "bell" curve. From https://www.scribbr.com/statistics/central-limit-theorem/#:~:text=By%20convention%2C%20we%20consider%20a,if%20the%20population%20is%20normal. looks like 30 is the number of reps we should use.

etiotto · 2024-10-01T14:50:34Z

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py

    v = torch.randn((Z, H, N_CTX, D_HEAD), device='xpu', dtype=dtype)
    sm_scale = 0.125
    quantiles = [0.5, 0.0, 1.0]
+    warmup, rep = 10, 600


From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.

From 10 to 600 times? Way too many repetitions. It is going to slow down the time it takes to run the benchmarks too much.

This value is measured in milliseconds and is needed for some test combinations where one run takes more than 100 ms.

whitneywhtsang · 2024-10-01T15:07:28Z

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

anmyachev · 2024-10-01T18:24:48Z

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

whitneywhtsang · 2024-10-01T18:32:21Z

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO.

etiotto · 2024-10-07T14:38:52Z

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO.

I agree. To me "rep" should really mean the number of repetition of the kernel after warmup. I think we can diverge from upstream here, and then try to upstream our changes.

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-10-14T16:43:55Z

benchmarks/triton_kernels_benchmark/benchmark_testing.py

-    # compute number of warmup and repeat
-    n_warmup = max(1, int(warmup / estimate_ms))
-    n_repeat = max(1, int(rep / estimate_ms))


There is no point in calculating the number of iterations through the expected time of one iteration, since the required number of iterations is requested by the user.

anmyachev · 2024-10-14T16:44:11Z

benchmarks/triton_kernels_benchmark/benchmark_testing.py

-    # compute number of warmup and repeat
-    n_warmup = max(1, int(warmup / estimate_ms))
-    n_repeat = max(1, int(rep / estimate_ms))


There is no point in calculating the number of iterations through the expected time of one iteration, since the required number of iterations is requested by the user.

anmyachev · 2024-10-14T16:47:22Z

benchmarks/triton_kernels_benchmark/benchmark_testing.py

+    # compute warmup and repeat times
+    warmup_time = n_warmup * estimate_ms
+    rep_time = n_repeat * estimate_ms


I translate the parameters into those that upstream (triton_do_bench) understands.

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev · 2024-10-14T18:13:16Z

If we revert #2142, then rep is the number of iterations, then the problem of NaNs in CV is gone?

@whitneywhtsang Most likely yes. However, I made a change to make do_bench function more similar to the one used in upstream triton. If this is not necessary, I can revert some of the changes.

I also see the benefit of being more similar to upstream triton, but rep meaning the number of iterations is more intuitive IMO.

I agree. To me "rep" should really mean the number of repetition of the kernel after warmup. I think we can diverge from upstream here, and then try to upstream our changes.

@etiotto @whitneywhtsang ready for review

whitneywhtsang · 2024-10-14T21:12:16Z

Please update PR description.

anmyachev and others added 3 commits September 16, 2024 11:16

Increase 'warmup' and 'rep' for FA benchmark

b1d2a0b

Signed-off-by: Anatoly Myachev <[email protected]>

Merge branch 'main' into amyachev/bench-time

339b709

Use 150ms

5ebbd01

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Sep 17, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

b1cc599

…ark.py

anmyachev commented Sep 17, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

anmyachev added 2 commits September 17, 2024 22:28

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

0ad146f

…ark.py

Merge branch 'main' into amyachev/bench-time

bbf0557

anmyachev commented Sep 23, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

anmyachev added 2 commits September 23, 2024 15:08

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

81fec9a

…ark.py

Merge branch 'main' into amyachev/bench-time

42e653a

anmyachev commented Sep 23, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

anmyachev added 2 commits September 23, 2024 17:08

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

8f81c13

…ark.py

Merge branch 'main' into amyachev/bench-time

5d08d3a

anmyachev commented Sep 29, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

b2d3398

…ark.py

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

anmyachev and others added 2 commits September 30, 2024 10:52

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

fe806b1

…ark.py

fix after merge

bf49b0d

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

7493632

…ark.py

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

524f81d

…ark.py

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

4d40864

…ark.py

anmyachev commented Sep 30, 2024

View reviewed changes

benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchmark.py Outdated Show resolved Hide resolved

Update benchmarks/triton_kernels_benchmark/flash_attention_fwd_benchm…

b0d91ce

…ark.py

anmyachev marked this pull request as ready for review September 30, 2024 13:29

anmyachev requested review from chengjunlu and yudongsi September 30, 2024 13:29

anmyachev requested review from etiotto and whitneywhtsang September 30, 2024 15:47

etiotto requested changes Oct 1, 2024

View reviewed changes

anmyachev added 3 commits October 14, 2024 15:35

Merge remote-tracking branch 'origin' into amyachev/bench-time

6809b9a

Change do_bench* signatures

e1c4f9f

Signed-off-by: Anatoly Myachev <[email protected]>

cleanup

a1fd0f9

Signed-off-by: Anatoly Myachev <[email protected]>

anmyachev commented Oct 14, 2024

View reviewed changes

anmyachev changed the title ~~Increase warmup and rep for FA benchmark~~ Use iteration count instead of time for parameters warmup and rep of do_bench* functions for benchmarks Oct 14, 2024

anmyachev added 2 commits October 14, 2024 17:01

fixes

f16b149

Signed-off-by: Anatoly Myachev <[email protected]>

fix

565d87c

Signed-off-by: Anatoly Myachev <[email protected]>

whitneywhtsang approved these changes Oct 14, 2024

View reviewed changes

chengjunlu approved these changes Oct 15, 2024

View reviewed changes

etiotto approved these changes Oct 15, 2024

View reviewed changes

etiotto merged commit d3a8eb0 into main Oct 15, 2024
6 checks passed

etiotto deleted the amyachev/bench-time branch October 15, 2024 14:35

Use iteration count instead of time for parameters warmup and rep of do_bench* functions for benchmarks #2256

Use iteration count instead of time for parameters warmup and rep of do_bench* functions for benchmarks #2256

Conversation

anmyachev commented Sep 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

anmyachev commented Sep 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

etiotto commented Oct 1, 2024

Uh oh!

etiotto left a comment

Choose a reason for hiding this comment

Uh oh!

etiotto Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

anmyachev Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

whitneywhtsang commented Oct 1, 2024

Uh oh!

anmyachev commented Oct 1, 2024

Uh oh!

whitneywhtsang commented Oct 1, 2024

Uh oh!

etiotto commented Oct 7, 2024

Uh oh!

anmyachev Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

anmyachev Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

anmyachev Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

anmyachev commented Oct 14, 2024

Uh oh!

whitneywhtsang commented Oct 14, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Use iteration count instead of time for parameters `warmup` and `rep` of `do_bench*` functions for benchmarks #2256

Use iteration count instead of time for parameters `warmup` and `rep` of `do_bench*` functions for benchmarks #2256

anmyachev commented Sep 16, 2024 •

edited

Loading

anmyachev commented Sep 30, 2024 •

edited

Loading